The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
churn = pd.read_csv("/content/drive/MyDrive/Python Course/BankChurners.csv")
# Copy the data
data = churn.copy()
# View the first 5 rows
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# View the last 5 rows
data.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
data.shape
(10127, 21)
There are 10127 rows and 21 columns.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
# Check for duplicate values in the data
data.duplicated().sum()
0
There are no duplicates in the data.
# Check for missing values in the data
data.isnull().sum()
| 0 | |
|---|---|
| CLIENTNUM | 0 |
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
data.describe(include=["object"]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
# View uniue valuesin the categorical columns
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Attrition_Flag are : Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64 ************************************************** Unique values in Gender are : Gender F 5358 M 4769 Name: count, dtype: int64 ************************************************** Unique values in Education_Level are : Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64 ************************************************** Unique values in Marital_Status are : Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64 ************************************************** Unique values in Income_Category are : Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64 ************************************************** Unique values in Card_Category are : Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64 **************************************************
# CLIENTNUM will not add value to the modeling since it consists of uniques ID for clients
# So, we will drop it
data.drop(["CLIENTNUM"], axis=1, inplace=True)
# Encode Existing and Attrited customers to 0 and 1 respectively
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
Customer_Age
histogram_boxplot(data, "Customer_Age", kde=True)
Months_on_book
histogram_boxplot(data, "Months_on_book", kde=True)
Credit_Limit
histogram_boxplot(data, "Credit_Limit", kde=True)
Total_Revolving_Bal
histogram_boxplot(data, "Total_Revolving_Bal", kde=True)
Avg_Open_To_Buy
histogram_boxplot(data, "Avg_Open_To_Buy", kde=True)
Total_Trans_Ct
histogram_boxplot(data, "Total_Trans_Ct", kde=True)
Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1", kde=True)
Total_Trans_Amt
histogram_boxplot(data, "Total_Trans_Amt", kde=True)
Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1", kde=True)
Avg_Utilization_Ratio
histogram_boxplot(data, "Avg_Utilization_Ratio", kde=True)
Dependent_count
labeled_barplot(data, "Dependent_count", True)
Total_Relationship_Count
labeled_barplot(data, 'Total_Relationship_Count', True)
Months_Inactive_12_mon
labeled_barplot(data, 'Months_Inactive_12_mon', True)
Contacts_Count_12_mon
labeled_barplot(data, 'Contacts_Count_12_mon', True)
Gender
labeled_barplot(data, 'Gender', True)
Education_Level
labeled_barplot(data, 'Education_Level', True)
Marital_Status
labeled_barplot(data, 'Marital_Status', True)
Income_Category
labeled_barplot(data, 'Income_Category', True)
Card_Category
labeled_barplot(data, 'Card_Category', True)
Attrition_Flag
labeled_barplot(data, 'Attrition_Flag', perc=True)
# Create a pairplot
sns.pairplot(data,hue='Attrition_Flag') #, diag_kind = 'hist'
plt.show()
Check for attributes that have a strong correlation with each other
Correlation Check
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Attrition_Flag vs Gender
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Marital_Status
stacked_barplot(data,"Marital_Status", "Attrition_Flag")
Attrition_Flag 0 1 All Marital_Status All 7880 1498 9378 Married 3978 709 4687 Single 3275 668 3943 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Education_Level
stacked_barplot(data,"Education_Level", "Attrition_Flag")
Attrition_Flag 0 1 All Education_Level All 7237 1371 8608 Graduate 2641 487 3128 High School 1707 306 2013 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
Customers with Doctorate degrees attrited the most at ~21%; followed by customers with Post-Graduate degrees at ~18%. Customers with College, Graduate, High School degrees and Uneducated customers attrited at ~15-16%.
Attrition_Flag vs Income_Category
stacked_barplot(data,"Income_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Income_Category All 8500 1627 10127 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 abc 925 187 1112 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
Customers making over 120K dollars, customers making less than 40K dollars, and those with an unknown income, attrited the most at ~17%. Customers making between 60K - 80K dollars, attrited the least at ~13.5%. That was the lowest attrition for this group. Customers making between 80K and 120K dollars had ~16% attrition.
Attrition_Flag vs Contacts_Count_12_mon
stacked_barplot(data,"Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Contacts_Count_12_mon All 8500 1627 10127 3 2699 681 3380 2 2824 403 3227 4 1077 315 1392 1 1391 108 1499 5 117 59 176 6 0 54 54 0 392 7 399 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Months_Inactive_12_mon
stacked_barplot(data,"Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Total_Relationship_Count
stacked_barplot(data,"Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag 0 1 All Total_Relationship_Count All 8500 1627 10127 3 1905 400 2305 2 897 346 1243 1 677 233 910 5 1664 227 1891 4 1687 225 1912 6 1670 196 1866 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Dependent_count
stacked_barplot(data,"Dependent_count", "Attrition_Flag")
Attrition_Flag 0 1 All Dependent_count All 8500 1627 10127 3 2250 482 2732 2 2238 417 2655 1 1569 269 1838 4 1314 260 1574 0 769 135 904 5 360 64 424 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag vs Credit_Limit
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")
Attrition_Flag vs Customer_Age
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag")
Attrition_Flag vs Total_Trans_Ct
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")
Attrition_Flag vs Total_Trans_Amt
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")
Attrition_Flag vs Total_Ct_Chng_Q4_Q1
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
Attrition_Flag vs Total_Amt_Chng_Q4_Q1
distribution_plot_wrt_target(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")
Attrition_Flag vs Avg_Utilization_Ratio
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")
Attrition_Flag vs Months_on_book
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")
Attrition_Flag vs Total_Revolving_Bal
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")
Attrition_Flag vs Avg_Open_To_Buy
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")
Questions:
total_amt_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# Get the parameters to see how much of outliers are in the data
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25) # To find the 25th percentile and 75th percentile.
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
# Determine how much outliers in each numeric column
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
| 0 | |
|---|---|
| Attrition_Flag | 16.066 |
| Customer_Age | 0.020 |
| Dependent_count | 0.000 |
| Months_on_book | 3.812 |
| Total_Relationship_Count | 0.000 |
| Months_Inactive_12_mon | 3.268 |
| Contacts_Count_12_mon | 6.211 |
| Credit_Limit | 9.717 |
| Total_Revolving_Bal | 0.000 |
| Avg_Open_To_Buy | 9.509 |
| Total_Amt_Chng_Q4_Q1 | 3.910 |
| Total_Trans_Amt | 8.848 |
| Total_Trans_Ct | 0.020 |
| Total_Ct_Chng_Q4_Q1 | 3.891 |
| Avg_Utilization_Ratio | 0.000 |
# Create a copy of the dataframe
data1 = data.copy()
# Replace "bad" data with Null values for further processing
data1["Income_Category"].replace("abc", np.nan, inplace=True)
# Check for Null Values
data1.isna().sum()
| 0 | |
|---|---|
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 1112 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
# Dividing train data into X and y
X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]
# Split data into training, validation and test set:
# I am using split sets 70/15/15 since there is not a lot of data in the full set
# Split the data into train-temp in the ratio 70:30
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
# Split the temp data into validation-test in the ratio 50:50
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1, stratify=y_temp)
print(X_train.shape, X_val.shape, X_test.shape)
(7088, 19) (1520, 19) (1519, 19)
# Create a list of columns with missing data
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# Create an instace of the imputer to be used
# This Imputer uses the Most Frequent data technique to replace Null values
imputer = SimpleImputer(strategy="most_frequent")
# Replace the data
# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Transform the validation data
X_val[reqd_col_for_impute] = imputer.transform(X_val[reqd_col_for_impute])
# Transform the test data
X_test[reqd_col_for_impute] = imputer.transform(X_test[reqd_col_for_impute])
# Check the class balance for the whole data sets
print("Percentage of classes in training set:")
print(100*y_train.value_counts(normalize=True), '\n')
print("Percentage of classes in val set:")
print(100*y_val.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(100*y_test.value_counts(normalize=True))
Percentage of classes in training set: Attrition_Flag 0 83.931 1 16.069 Name: proportion, dtype: float64 Percentage of classes in val set: Attrition_Flag 0 83.947 1 16.053 Name: proportion, dtype: float64 Percentage of classes in test set: Attrition_Flag 0 83.937 1 16.063 Name: proportion, dtype: float64
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
Gender F 3770 M 3318 Name: count, dtype: int64 ****************************** Education_Level Graduate 3247 High School 1425 Uneducated 1031 College 709 Post-Graduate 364 Doctorate 312 Name: count, dtype: int64 ****************************** Marital_Status Married 3815 Single 2771 Divorced 502 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 3273 $40K - $60K 1254 $80K - $120K 1084 $60K - $80K 974 $120K + 503 Name: count, dtype: int64 ****************************** Card_Category Blue 6621 Silver 375 Gold 78 Platinum 14 Name: count, dtype: int64 ******************************
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)
Gender F 803 M 717 Name: count, dtype: int64 ****************************** Education_Level Graduate 714 High School 295 Uneducated 221 College 147 Post-Graduate 77 Doctorate 66 Name: count, dtype: int64 ****************************** Marital_Status Married 812 Single 596 Divorced 112 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 721 $40K - $60K 265 $60K - $80K 220 $80K - $120K 216 $120K + 98 Name: count, dtype: int64 ****************************** Card_Category Blue 1402 Silver 98 Gold 15 Platinum 5 Name: count, dtype: int64 ******************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 30)
Gender F 785 M 734 Name: count, dtype: int64 ****************************** Education_Level Graduate 686 High School 293 Uneducated 235 College 157 Post-Graduate 75 Doctorate 73 Name: count, dtype: int64 ****************************** Marital_Status Married 809 Single 576 Divorced 134 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 679 $40K - $60K 271 $80K - $120K 235 $60K - $80K 208 $120K + 126 Name: count, dtype: int64 ****************************** Card_Category Blue 1413 Silver 82 Gold 23 Platinum 1 Name: count, dtype: int64 ******************************
# Encode data (One-Hot)
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(7088, 29) (1520, 29) (1519, 29)
# check the top 5 rows from the train dataset
X_train.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4124 | 50 | 1 | 43 | 6 | 1 | 2 | 7985.000 | 0 | 7985.000 | 1.032 | 3873 | 72 | 0.674 | 0.000 | False | False | True | False | False | False | True | False | False | False | False | True | False | False | False |
| 4686 | 50 | 0 | 36 | 3 | 3 | 2 | 5444.000 | 2499 | 2945.000 | 0.468 | 4509 | 80 | 0.667 | 0.459 | True | False | True | False | False | False | False | False | False | True | False | False | False | False | False |
| 1276 | 26 | 0 | 13 | 6 | 3 | 4 | 1643.000 | 1101 | 542.000 | 0.713 | 2152 | 50 | 0.471 | 0.670 | False | False | True | False | False | False | False | True | True | False | False | False | False | False | False |
| 6119 | 65 | 0 | 55 | 3 | 3 | 0 | 2022.000 | 0 | 2022.000 | 0.579 | 4623 | 65 | 0.548 | 0.000 | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False |
| 2253 | 46 | 3 | 35 | 6 | 3 | 4 | 4930.000 | 0 | 4930.000 | 1.019 | 3343 | 77 | 0.638 | 0.000 | True | False | True | False | False | False | False | True | False | False | True | False | False | False | False |
The nature of predictions made by the classification model will translate as follows:
True positives (TP) are failures correctly predicted by the model.
Which metric to optimize?
My Model Evaluation Criterion
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
NOTE:
I am going to use a two-pronged approach to build and evaluate models:
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Decision Tree: 1.0 Bagging: 0.9850746268656716 Random forest: 1.0 AdaBoost: 0.8410886742756805 Gradient Boosting: 0.8902546093064091 XGBoost: 1.0 Validation Performance: Decision Tree: 0.8483606557377049 Bagging: 0.8442622950819673 Random forest: 0.8155737704918032 AdaBoost: 0.8401639344262295 Gradient Boosting: 0.8688524590163934 XGBoost: 0.9139344262295082
# Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
# Calculating different metrics
dtree_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",dtree_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
dtree_model_test_perf=model_performance_classification_sklearn(d_tree,X_val,y_val)
print("Validating performance:\n",dtree_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.947 0.848 0.825 0.836
# Fitting the model
bag = BaggingClassifier(random_state=1)
bag.fit(X_train,y_train)
# Calculating different metrics
bag_model_train_perf=model_performance_classification_sklearn(bag,X_train,y_train)
print("Training performance:\n",bag_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 0.997 0.985 0.996 0.990
# Validation data
bag_model_test_perf=model_performance_classification_sklearn(bag,X_val,y_val)
print("Validating performance:\n",bag_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.959 0.844 0.892 0.867
# Fitting the model
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
# Calculating different metrics
rf_model_train_perf=model_performance_classification_sklearn(rf,X_train,y_train)
print("Training performance:\n",rf_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
rf_model_test_perf=model_performance_classification_sklearn(rf,X_val,y_val)
print("Validating performance:\n",rf_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.957 0.816 0.905 0.858
# Fitting the model
adb = AdaBoostClassifier(random_state=1)
adb.fit(X_train,y_train)
# Calculating different metrics
adb_model_train_perf=model_performance_classification_sklearn(adb,X_train,y_train)
print("Training performance:\n",adb_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 0.957 0.841 0.886 0.863
# Validation data
adb_model_test_perf=model_performance_classification_sklearn(adb,X_val,y_val)
print("Validating performance:\n",adb_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.953 0.840 0.861 0.851
# Fitting the model
gb = GradientBoostingClassifier(random_state=1)
gb.fit(X_train,y_train)
# Calculating different metrics
gb_model_train_perf=model_performance_classification_sklearn(gb,X_train,y_train)
print("Training performance:\n",gb_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 0.976 0.890 0.957 0.922
# Validation data
gb_model_test_perf=model_performance_classification_sklearn(gb,X_val,y_val)
print("Validating performance:\n",gb_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.966 0.869 0.914 0.891
# Fitting the model
xgb = XGBClassifier(random_state=1)
xgb.fit(X_train,y_train)
# Calculating different metrics
xgb_model_train_perf=model_performance_classification_sklearn(xgb,X_train,y_train)
print("Training performance:\n",xgb_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb, X_train, y_train)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
xgb_model_test_perf=model_performance_classification_sklearn(xgb,X_val,y_val)
print("Validating performance:\n",xgb_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.972 0.914 0.914 0.914
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 1139 Before Oversampling, counts of label 'No': 5949 After Oversampling, counts of label 'Yes': 5949 After Oversampling, counts of label 'No': 5949 After Oversampling, the shape of train_X: (11898, 29) After Oversampling, the shape of train_y: (11898,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Training Performance: Decision Tree: 1.0 Bagging: 0.998150949739452 Random forest: 1.0 AdaBoost: 0.9695747184400739 Gradient Boosting: 0.979660447133972 XGBoost: 1.0 Validation Performance: Decision Tree: 0.8811475409836066 Bagging: 0.9016393442622951 Random forest: 0.8647540983606558 AdaBoost: 0.8852459016393442 Gradient Boosting: 0.9057377049180327 XGBoost: 0.9180327868852459
# Fitting the model
d_tree_over = DecisionTreeClassifier(random_state=1)
d_tree_over.fit(X_train_over,y_train_over)
# Calculating different metrics
dtree_over_model_train_perf=model_performance_classification_sklearn(d_tree_over,X_train_over,y_train_over)
print("Training performance:\n",dtree_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
dtree_over_model_test_perf=model_performance_classification_sklearn(d_tree_over,X_val,y_val)
print("Validating performance:\n",dtree_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.939 0.881 0.773 0.824
# Fitting the model
bag_over = BaggingClassifier(random_state=1)
bag_over.fit(X_train_over,y_train_over)
# Calculating different metrics
bag_over_model_train_perf=model_performance_classification_sklearn(bag_over,X_train_over,y_train_over)
print("Training performance:\n",bag_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 0.999 0.998 0.999 0.999
# Validation data
bag_over_model_test_perf=model_performance_classification_sklearn(bag_over,X_val,y_val)
print("Validating performance:\n",bag_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.947 0.902 0.797 0.846
# Fitting the model
rf_over = RandomForestClassifier(random_state=1)
rf_over.fit(X_train_over,y_train_over)
# Calculating different metrics
rf_over_model_train_perf=model_performance_classification_sklearn(rf_over,X_train,y_train)
print("Training performance:\n",rf_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
rf_over_model_test_perf=model_performance_classification_sklearn(rf_over,X_val,y_val)
print("Validating performance:\n",rf_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.949 0.865 0.824 0.844
# Fitting the model
adb_over = AdaBoostClassifier(random_state=1)
adb_over.fit(X_train_over,y_train_over)
# Calculating different metrics
adb_over_model_train_perf=model_performance_classification_sklearn(adb_over,X_train_over,y_train_over)
print("Training performance:\n",adb_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 0.964 0.970 0.960 0.965
# Validation data
adb_over_model_test_perf=model_performance_classification_sklearn(adb_over,X_val,y_val)
print("Validating performance:\n",adb_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.938 0.885 0.766 0.821
# Fitting the model
gb_over = GradientBoostingClassifier(random_state=1)
gb_over.fit(X_train_over,y_train_over)
# Calculating different metrics
gb_over_model_train_perf=model_performance_classification_sklearn(gb_over,X_train_over,y_train_over)
print("Training performance:\n",gb_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 0.977 0.980 0.975 0.977
# Validation data
gb_over_model_test_perf=model_performance_classification_sklearn(gb_over,X_val,y_val)
print("Validating performance:\n",gb_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.951 0.906 0.810 0.855
# Fitting the model
xgb_over = XGBClassifier(random_state=1)
xgb_over.fit(X_train_over,y_train_over)
# Calculating different metrics
xgb_over_model_train_perf=model_performance_classification_sklearn(xgb_over,X_train_over,y_train_over)
print("Training performance:\n",xgb_over_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_over, X_train_over, y_train_over)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
xgb_over_model_test_perf=model_performance_classification_sklearn(xgb_over,X_val,y_val)
print("Validating performance:\n",xgb_over_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_over, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.968 0.918 0.885 0.901
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1139 Before Under Sampling, counts of label 'No': 5949 After Under Sampling, counts of label 'Yes': 1139 After Under Sampling, counts of label 'No': 1139 After Under Sampling, the shape of train_X: (2278, 29) After Under Sampling, the shape of train_y: (2278,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Training Performance: Decision Tree: 1.0 Bagging: 0.9885864793678666 Random forest: 1.0 AdaBoost: 0.9446883230904302 Gradient Boosting: 0.9754170324846356 XGBoost: 1.0 Validation Performance: Decision Tree: 0.8975409836065574 Bagging: 0.9385245901639344 Random forest: 0.9672131147540983 AdaBoost: 0.9426229508196722 Gradient Boosting: 0.9672131147540983 XGBoost: 0.9631147540983607
# Fitting the model
d_tree_un = DecisionTreeClassifier(random_state=1)
d_tree_un.fit(X_train_un,y_train_un)
# Calculating different metrics
dtree_un_model_train_perf=model_performance_classification_sklearn(d_tree_un,X_train_un,y_train_un)
print("Training performance:\n",dtree_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
dtree_un_model_test_perf=model_performance_classification_sklearn(d_tree_un,X_val,y_val)
print("Validating performance:\n",dtree_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(d_tree_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.886 0.898 0.597 0.717
# Fitting the model
bag_un = BaggingClassifier(random_state=1)
bag_un.fit(X_train_un,y_train_un)
# Calculating different metrics
bag_un_model_train_perf=model_performance_classification_sklearn(bag_un,X_train_un,y_train_un)
print("Training performance:\n",bag_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 0.993 0.989 0.997 0.993
# Validation data
bag_un_model_test_perf=model_performance_classification_sklearn(bag_un,X_val,y_val)
print("Validating performance:\n",bag_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(bag_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.925 0.939 0.698 0.801
# Fitting the model
rf_un = RandomForestClassifier(random_state=1)
rf_un.fit(X_train_un,y_train_un)
# Calculating different metrics
rf_un_model_train_perf=model_performance_classification_sklearn(rf_un,X_train,y_train)
print("Training performance:\n",rf_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 0.959 1.000 0.795 0.886
# Validation data
rf_un_model_test_perf=model_performance_classification_sklearn(rf_un,X_val,y_val)
print("Validating performance:\n",rf_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(rf_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.930 0.967 0.707 0.817
# Fitting the model
adb_un = AdaBoostClassifier(random_state=1)
adb_un.fit(X_train_un,y_train_un)
# Calculating different metrics
adb_un_model_train_perf=model_performance_classification_sklearn(adb_un,X_train_un,y_train_un)
print("Training performance:\n",adb_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 0.942 0.945 0.940 0.942
# Validation data
adb_un_model_test_perf=model_performance_classification_sklearn(adb_un,X_val,y_val)
print("Validating performance:\n",adb_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(adb_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.934 0.943 0.726 0.820
# Fitting the model
gb_un = GradientBoostingClassifier(random_state=1)
gb_un.fit(X_train_un,y_train_un)
# Calculating different metrics
gb_un_model_train_perf=model_performance_classification_sklearn(gb_un,X_train_un,y_train_un)
print("Training performance:\n",gb_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 0.973 0.975 0.971 0.973
# Validation data
gb_un_model_test_perf=model_performance_classification_sklearn(gb_un,X_val,y_val)
print("Validating performance:\n",gb_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(gb_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.941 0.967 0.742 0.840
# Fitting the model
xgb_un = XGBClassifier(random_state=1)
xgb_un.fit(X_train_un,y_train_un)
# Calculating different metrics
xgb_un_model_train_perf=model_performance_classification_sklearn(xgb_un,X_train_un,y_train_un)
print("Training performance:\n",xgb_un_model_train_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_un, X_train_un, y_train_un)
Training performance:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
# Validation data
xgb_un_model_test_perf=model_performance_classification_sklearn(xgb_un,X_val,y_val)
print("Validating performance:\n",xgb_un_model_test_perf)
# Creating confusion matrix
confusion_matrix_sklearn(xgb_un, X_val, y_val)
Validating performance:
Accuracy Recall Precision F1
0 0.940 0.963 0.741 0.838
Note
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
%%time
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.9306476543782363:
CPU times: user 1.45 s, sys: 194 ms, total: 1.65 s
Wall time: 21.2 s
tuned_xgb = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.7,
scale_pos_weight=5,
n_estimators=100,
learning_rate=0.1,
gamma=3,
)
tuned_xgb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)modelName = "Tuned XGBoost model with original data"
scores = recall_score(y_train, tuned_xgb.predict(X_train))
print("{}: {} : {}".format(modelName, "train", scores))
scores = recall_score(y_val, tuned_xgb.predict(X_val))
print("{}: {} : {}".format(modelName, "val", scores))
Tuned XGBoost model with original data: train : 0.9991220368744512 Tuned XGBoost model with original data: val : 0.9549180327868853
print(modelName, "\n")
tuned_xgb_model_train_perf=model_performance_classification_sklearn(tuned_xgb,X_train,y_train)
print("Training performance:\n",tuned_xgb_model_train_perf)
tuned_xgb_model_val_perf=model_performance_classification_sklearn(tuned_xgb,X_val,y_val)
print("Validating performance:\n",tuned_xgb_model_val_perf)
Tuned XGBoost model with original data
Training performance:
Accuracy Recall Precision F1
0 0.988 0.999 0.933 0.965
Validating performance:
Accuracy Recall Precision F1
0 0.963 0.955 0.838 0.893
# defining model
Model1 = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model1, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)} with CV score=0.7963250637607233:
tuned_gb = XGBClassifier(
init= DecisionTreeClassifier(random_state=1),
subsample=0.7,
max_features=0.7,
n_estimators=100,
learning_rate=0.05
)
tuned_gb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
init=DecisionTreeClassifier(random_state=1),
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_features=0.7,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
init=DecisionTreeClassifier(random_state=1),
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_features=0.7,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, ...)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
modelName1 = "Tuned Gradient Boosting model with Original data"
scores = recall_score(y_train, tuned_gb.predict(X_train))
print("{}: {}: {}".format(modelName1, "train",scores))
scores = recall_score(y_val, tuned_gb.predict(X_val))
print("{}: {}: {}".format(modelName1, "val", scores))
Tuned Gradient Boosting model with Original data: train: 0.9464442493415277 Tuned Gradient Boosting model with Original data: val: 0.8852459016393442
print(modelName1, "\n")
tuned_gb_model_train_perf=model_performance_classification_sklearn(tuned_gb,X_train,y_train)
print("Training performance:\n",tuned_gb_model_train_perf)
tuned_gb_model_val_perf=model_performance_classification_sklearn(tuned_gb,X_val,y_val)
print("Validating performance:\n",tuned_gb_model_val_perf)
Tuned Gradient Boosting model with Original data
Training performance:
Accuracy Recall Precision F1
0 0.987 0.946 0.971 0.959
Validating performance:
Accuracy Recall Precision F1
0 0.968 0.885 0.911 0.898
# defining model
Model_over = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model_over, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9899156836830612:
tuned_xgb_over = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.7,
scale_pos_weight=5,
n_estimators=75,
learning_rate=0.05,
gamma=3,
)
tuned_xgb_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)modelName_over = "Tuned XGBoost model with OverSample data"
scores = recall_score(y_train_over, tuned_xgb_over.predict(X_train_over))
print("{}: {}: {}".format(modelName_over, "train", scores))
scores = recall_score(y_val, tuned_xgb_over.predict(X_val))
print("{}: {}: {}".format(modelName_over, "val", scores))
Tuned XGBoost model with OverSample data: train: 0.9993276180870735 Tuned XGBoost model with OverSample data: val: 0.9549180327868853
print(modelName_over, "\n")
tuned_xgb_over_model_train_perf=model_performance_classification_sklearn(tuned_xgb_over,X_train,y_train)
print("Training performance:\n",tuned_xgb_over_model_train_perf)
tuned_xgb_over_model_val_perf=model_performance_classification_sklearn(tuned_xgb_over,X_val,y_val)
print("Validating performance:\n",tuned_xgb_over_model_val_perf)
Tuned XGBoost model with OverSample data
Training performance:
Accuracy Recall Precision F1
0 0.948 0.997 0.755 0.859
Validating performance:
Accuracy Recall Precision F1
0 0.922 0.955 0.685 0.798
# defining model
Model_under = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model_under, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9762964680423526:
tuned_xgb_un = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.7,
scale_pos_weight=5,
n_estimators=75,
learning_rate=0.05,
gamma=3,
)
tuned_xgb_un.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)modelName_un = "Tuned XGBoost model with UnderSample data"
scores = recall_score(y_train_un, tuned_xgb_un.predict(X_train_un))
print("{}: {}: {}".format(modelName_un, "train", scores))
scores = recall_score(y_val, tuned_xgb_un.predict(X_val))
print("{}: {}: {}".format(modelName_un, "val", scores))
Tuned XGBoost model with UnderSample data: train: 1.0 Tuned XGBoost model with UnderSample data: val: 0.9836065573770492
print(modelName_over, "\n")
tuned_xgb_un_model_train_perf=model_performance_classification_sklearn(tuned_xgb_un,X_train_un,y_train_un)
print("Training performance:\n",tuned_xgb_un_model_train_perf)
tuned_xgb_un_model_val_perf=model_performance_classification_sklearn(tuned_xgb_un,X_val,y_val)
print("Validating performance:\n",tuned_xgb_un_model_val_perf)
Tuned XGBoost model with OverSample data
Training performance:
Accuracy Recall Precision F1
0 0.960 1.000 0.927 0.962
Validating performance:
Accuracy Recall Precision F1
0 0.880 0.984 0.573 0.724
# defining model
Model_under1 = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model_under1, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)} with CV score=0.9016461859494551:
tuned_gb_un = XGBClassifier(
init= DecisionTreeClassifier(random_state=1),
subsample=0.7,
max_features=0.7,
n_estimators=100,
learning_rate=0.05
)
tuned_gb_un.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
init=DecisionTreeClassifier(random_state=1),
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_features=0.7,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
init=DecisionTreeClassifier(random_state=1),
interaction_constraints=None, learning_rate=0.05, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_features=0.7,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, ...)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
modelName_un = "Tuned Gradient Boosting model with UnderSample data"
scores = recall_score(y_train_un, tuned_gb_un.predict(X_train_un))
print("{}: {}: {}".format(modelName_un, "train",scores))
scores = recall_score(y_val, tuned_gb_un.predict(X_val))
print("{}: {}: {}".format(modelName_un, "val", scores))
Tuned Gradient Boosting model with UnderSample data: train: 0.9938542581211589 Tuned Gradient Boosting model with UnderSample data: val: 0.9631147540983607
print(modelName_un, "\n")
tuned_gb_un_model_train_perf=model_performance_classification_sklearn(tuned_gb_un,X_train_un,y_train_un)
print("Training performance:\n",tuned_gb_un_model_train_perf)
tuned_gb_un_model_val_perf=model_performance_classification_sklearn(tuned_gb_un,X_val,y_val)
print("Validating performance:\n",tuned_gb_un_model_val_perf)
Tuned Gradient Boosting model with UnderSample data
Training performance:
Accuracy Recall Precision F1
0 0.990 0.994 0.986 0.990
Validating performance:
Accuracy Recall Precision F1
0 0.936 0.963 0.725 0.827
# training performance comparison
models_train_comp_df = pd.concat(
[
tuned_xgb_model_train_perf.T,
tuned_gb_model_train_perf.T,
tuned_xgb_over_model_train_perf.T,
tuned_xgb_un_model_train_perf.T,
tuned_gb_un_model_train_perf.T
],
axis=1,
)
models_train_comp_df.columns = [
"Tuned XGBoost trained with Original data",
"Tuned Gradient Boosting trained with Original data",
"Tuned XGBoost trained with Oversampled data",
"Tuned XGBoost trained with Undersampled data",
"Tuned Gradient Boosting trained with Undersampled data"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Tuned XGBoost trained with Original data | Tuned Gradient Boosting trained with Original data | Tuned XGBoost trained with Oversampled data | Tuned XGBoost trained with Undersampled data | Tuned Gradient Boosting trained with Undersampled data | |
|---|---|---|---|---|---|
| Accuracy | 0.988 | 0.987 | 0.948 | 0.960 | 0.990 |
| Recall | 0.999 | 0.946 | 0.997 | 1.000 | 0.994 |
| Precision | 0.933 | 0.971 | 0.755 | 0.927 | 0.986 |
| F1 | 0.965 | 0.959 | 0.859 | 0.962 | 0.990 |
# validation performance comparison
models_val_comp_df = pd.concat(
[
tuned_xgb_model_val_perf.T,
tuned_gb_model_val_perf.T,
tuned_xgb_over_model_val_perf.T,
tuned_xgb_un_model_val_perf.T,
tuned_gb_un_model_val_perf.T
],
axis=1,
)
models_val_comp_df.columns = [
"Tuned XGBoost validated with Original data",
"Tuned Gradient Boosting validated with Original data",
"Tuned XGBoost validated with Oversampled data",
"Tuned XGBoost validated with Undersampled data",
"Tuned Gradient Boosting validated with Undersampled data"
]
print("Validating performance comparison:")
models_val_comp_df
Validating performance comparison:
| Tuned XGBoost validated with Original data | Tuned Gradient Boosting validated with Original data | Tuned XGBoost validated with Oversampled data | Tuned XGBoost validated with Undersampled data | Tuned Gradient Boosting validated with Undersampled data | |
|---|---|---|---|---|---|
| Accuracy | 0.963 | 0.968 | 0.922 | 0.880 | 0.936 |
| Recall | 0.955 | 0.885 | 0.955 | 0.984 | 0.963 |
| Precision | 0.838 | 0.911 | 0.685 | 0.573 | 0.725 |
| F1 | 0.893 | 0.898 | 0.798 | 0.724 | 0.827 |
I will run the two models I chose for academic and production purposes on the test data.
final_test_perf=model_performance_classification_sklearn(tuned_xgb_un,X_test,y_test)
print("Final Test performance:\n",final_test_perf)
Final Test performance:
Accuracy Recall Precision F1
0 0.891 0.980 0.598 0.742
final_test_perf1=model_performance_classification_sklearn(tuned_xgb,X_test,y_test)
print("Final Test performance:\n",final_test_perf1)
Final Test performance:
Accuracy Recall Precision F1
0 0.970 0.934 0.884 0.908
feature_names = X_train.columns
importances = tuned_xgb_un.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
feature_names = X_train.columns
importances = tuned_xgb.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
NOTE:
from sklearn.ensemble import StackingClassifier
estimators = [('Tuned XGBoost With UnderSampled Data',tuned_xgb_un), ('Tuned Gradient Boosting WIth Original Data',tuned_gb), ('Tuned XGBoost With OverSampling', tuned_xgb_over), ('Tuned Gradient Boosting with Undersampled data', tuned_gb_un)]
final_estimator = tuned_xgb
stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)
stacking_classifier.fit(X_train,y_train)
StackingClassifier(estimators=[('Tuned XGBoost With UnderSampled Data',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=3,
grow_policy=None,
importance_type=None,
interac...
feature_types=None, gamma=3,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(estimators=[('Tuned XGBoost With UnderSampled Data',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=3,
grow_policy=None,
importance_type=None,
interac...
feature_types=None, gamma=3,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.05, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Calculating Performance metrics
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
stacking_classifier_model_val_perf=model_performance_classification_sklearn(stacking_classifier,X_val,y_val)
print("Validating performance:\n",stacking_classifier_model_val_perf)
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)
# Creating Training confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_train,y_train)
# Creating Validation confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_val,y_val)
# Creating Test confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.971 0.985 0.854 0.915
Validating performance:
Accuracy Recall Precision F1
0 0.943 0.967 0.752 0.846
Testing performance:
Accuracy Recall Precision F1
0 0.952 0.943 0.796 0.863
feature_names = X_train.columns
importances = tuned_xgb.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()